Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with upstream #271

Merged
merged 58 commits into from
Dec 20, 2024
Merged

Merge with upstream #271

merged 58 commits into from
Dec 20, 2024

Conversation

dhil
Copy link
Member

@dhil dhil commented Dec 20, 2024

No description provided.

alexcrichton and others added 30 commits December 12, 2024 14:08
…ance#9794)

This commit first fixed an issue with table access codegen to disable
spectre mitigations on Pulley targets like how spectre is disabled for
memory accesses as well. This unblocked many tests related to tables
which then led to a different error about a `trapnz` with an 8-bit value
not being supported.

In fixing `trapnz` with 8-bit values this PR went ahead and did a
general-purpose refactoring for how conditional branches are managed.
Previously conditional traps and conditional branches had some
duplicated logic and the goal was to unify everything. There is now a
single `Cond` which represents the condition of a conditional jump which
is used uniformly for all locations such as `select`, `brif`, and
`trap[n]z`. This new type represents all the sorts of conditional
branches that can be done in Pulley, for example integer comparisons and
whether or not a register is zero. This `Cond` type has various helpers
for printing it, inverting it, collecting operands, emission, etc.

The end result is that it's a bit wordy to work with `Cond` right now
due to the size of the variants but all locations working with
conditional traps are deduplicated and now it's just repetitive logic
rather than duplicated logic.

Putting all of this together gets a large batch of spec tests working.

I'll note that this does remove a feature where `trapnz` was turned into
nothing or an unconditional trap if the argument was a constant, but
that feels like an optimization perhaps best left for the middle-end
rather than doing it in the backend.

cc bytecodealliance#9783
Implement float-to-int bitcasts as well as the xor operation for
integers.

cc bytecodealliance#9783
)

* cranelift: Round inline stack probes down, not up

When we have `enable_probestack` turned on and set `probestack_strategy`
to "inline", we have to compute how many pages of the stack we'll probe.

The current implementation rounds our stack frame size up to the nearest
multiple of the page size, then probes each page once.

However, if our stack frame is not a multiple of the page size, that
means there's a partial page at the end. It's not necessary to probe
that partial page, just like it's unnecessary to probe at all if the
frame is smaller than one page. Either way, any signal handler needs to
be prepared for stack accesses on that last page to fault at any time
during the function's execution.

* Add comments explaining why we round down.

* Port the round-down code to s390x too.

* Update s390x expected outputs.

---------

Co-authored-by: Dan Gohman <dev@sunfishcode.online>
* Use the C API as the example instead of the `wasmtime` CLI now that
  Cranelift can be disabled.
* Update custom platform docs to talk about `signals-based-traps`.
* pulley: Get `strings.wast` test passing

Needed the `imul` CLIF instruction to get implemented so 32/64-bit
multiplication have now been added.

cc bytecodealliance#9783

* Flag test as now passing
…ance#9803)

* pulley: Fill out remaining 32/64-bit integer operations

This required some extra plumbing to shepherd the precise reason why
signed division trapped to Wasmtime which is done through an extra
`TrapKind` side channel now added.

This then additionally fixes the signed remainder interpreter function
to return 0 on `MIN % -1` which is different from what Rust specifies
(which is to return `None` or panic).

cc bytecodealliance#9783

* Reduce some code duplication

* Fix pulley tests

* Fix MSRV compat
* pulley: Implement float<->int conversions

Gets the `conversions.wast` test running along with a few other misc
ones.

cc bytecodealliance#9783

* Fix pulley's no_std build

* One more conversion to a workspace dep
* Update to WASI 0.2.3 WIT files

No major changes here; this just updates WIT files from WASI 0.2.2 to 0.2.3.

* Update the vendor-wit.sh script to 0.2.3.
* pulley: Implement fcopysign for issue4890.wast

* Fix #[no_std] issue + feedback
* pulley: Move `fp`/`lr` out of `XReg` set

This commit moves the `fp` and `lr` registers out of the `XReg` register
set and into the `MachineState` per-VM. These are automatically modified
and read with `push_frame` and `pop_frame`. Dedicated `xmov_{fp,lr}`
instructions were added for use in Wasmtime's trampolines which directly
read these registers.

* Fix pulley tests

* Free up `spilltmp1` register

Also unused in CLIF
* pulley: Implement float arithmetic operations

Fill out enough to get `f32.wast` and `f64.wast` spec tests working. A
minor ABI issue was discovered along the way which is also required to
get a new test working on both 32 and 64-bit platforms.

cc bytecodealliance#9783

* Centralize handling of float operations

Add a new `wasmtime-math` crate which is tasked with providing the float
math functions needed by Wasm with with, in theory, most optimal
platform implementation available.

* Fix riscv64 tests
* pulley: Get `call_indirect.wast` spec test working

Fix a few typos here and there related to floats to get some tests
passing.

cc bytecodealliance#9783

* Update CLIF test expectations
This hasn't actually turned up anything in quite some time and MIRI
testing is relatively slow so by default don't test MIRI on Pulley PRs
(it's of course still tested on the merge queue though)
Fill out a lowering for CLIF's `ineg` instruction.

cc bytecodealliance#9783
Gets some `embenchen_*.wast` tests passing.

cc bytecodealliance#9783
…9818)

Forcibly disable the `threads` proposal for Pulley. This is done because
I don't believe there's any way that we can, in Rust, implement
non-atomic loads and stores in a manner that isn't undefined behavior.
Until this is available leave this proposal as flagged as unsupported.
* pulley: Initial scaffold of SIMD support

This commit fills out some of the initial infrastructure necessary for
supporting the SIMD proposal to WebAssembly in the Pulley interpreter,
namely 128-bit simd. The `VRegVal` union has been filled out with
various types, endianness questions are settled, and initial
implementations of a suite of opcodes are added to get a basic set of
tests working throughout the backend.

cc bytecodealliance#9783

* Avoid dealing with big-endian vectors

* Change wasm `global`s to store `v128` in little-endian format.
* Change pulley stack loads/stores to work with vectors in little-endian
  format.
This commit adds the pulley targets to many of the preexisting `*.clif`
runtests throughout the tree. This covers most of the MVP functionality
of wasm for example and additionally exercises 8 and 16-bit lowerings
for many instructions. A few minor pulley instructions were added and
otherwise new 8/16-bit lowerings use existing instructions. It's
expected that the 8/16-bit lowerings won't be used all that often so
they're not particularly optimal at this time.

Some CLIF tests were omitted such as:

* Most SIMD-using CLIF tests
* Float/int conversion tests using 8 and 16-bit integers
* Tests with `call` instructions as relocations don't work with the JIT
  crate on Pulley
* Tests using 128-bit integers

Support for some of these tests may be enabled in the future, but for
example 8/16-bit integers may not get used much.
…liance#9829)

* Add an inheritance test

* Only allow GCing tags we (mostly) know are safe to GC

Make the conservatively correct choice the default one.
…liance#9824)

It seems that this was removed in bytecodealliance#9381

The new method is really nice for controlling this at runtime, however the docs still mentioned the old behavior.
This is not directly reachable from wasm but can be created through
optimizations.
* pulley: Implement `return_call` instructions

This commit fleshes out the Cranelift lowerings of tail calls which gets
the wasm tail call proposal itself working on Pulley. Most of the bits
and pieces here were copied over from the riscv64 backend and then
edited to suit Pulley.

* Fix table64 addressing on 32-bit
…liance#9821)

* allow customizing log prefixes for wasmtime serve command

Signed-off-by: Joseph Zhang <jzhang@absolute.com>

* pr feedback - use simple boolean flag instead

---------

Signed-off-by: Joseph Zhang <jzhang@absolute.com>
* pulley: Implement SIMD `splat` instruction

Gets a few spec tests and CLIF tests passing

cc bytecodealliance#9783

* Fix typo
Fill out vector load-and-extend instructions.
alexcrichton and others added 28 commits December 17, 2024 16:33
Fill out some bit-related operations for vector registers.
Fill out some bitmask/test instructions for vectors.
* emit 32bit udiv

* winch: aarch64 udiv/urem without extension

* remove stray dbg!

* fmt

* remove println

* fix formatting in ISLE

* Sized TrapIf

* move operand size into CondBrKind variant

* show_reg_sized fallback
* pulley: Get `simd_conversions.wast` test working

Lots of narrowing/extending/conversion-related opcodes implemented. Note
that these opcodes are all in the "extended" namespace as the 1-byte
namespace has started to overflow.

* Fill out some TODO
* Move memories first in `VMContext`

This commit shuffles the fields of `VMContext` to move memories first
above all other items. This is done to ensure that the offset to
memory-related fields is one of the smaller offsets and ideally fits within an
8-byte offset which helps fit into some special cases on various ISAs.
For example x64 has a smaller instruction encoding for 8-bit offsets
than 32-bit offsets, and I'm planning on doing a similar optimization
for Pulley.

* Review comments
* pulley: Add basic i128 comparison to backend

Don't add special opcodes yet as this probably isn't performance
critical in an interpreted context, but it should always be possible to
add such opcodes in the future.

* Review comments
…e#9845)

* Optimize constant offset loads/stores in translation

This commit implements an optimization when lowering Wasm bytecode to
CLIF to skip a bounds check when the offset in memory is statically
known. This comes up in C/C++/Rust when `static` memory is addressed for
example where LLVM emits an `i32.const 0` as the base address and puts
the address of the variable in the `offset` instruction field. This
isn't necessary for 64-bit platforms but when explicit bounds-checks are
required this can help to eliminate a constant-true bounds-check.

* Review comments
…9852)

No new pulley opcodes here, just reusing preexisting opcodes.
Pushes more into the 32-bit offset on load/store instructions.
…#9854)

* pulley: Shuffle opcodes to free some 1-byte opcodes

In the near future the set of opcodes here are going to be expanded
along a number of axes such as:

* More modes of addressing loads/stores other than just `reg + offset32`.
* Opcodes with immediate operands rather than unconditional register operands.

The 1-byte opcode namespace is already filling up and there's a bit of a
mishmash of what's 1-byte and what's "extended" for now. To help bridge
this gap in the interim shuffle all float/vector-related opcodes into
the "extended" opcode namespace. This frees up a large chunk of the
1-byte opcode namespace for future expansion with extensions like above.

I'll note that I'm not 100% sure that the opcodes will all stay here
after this reshuffling. I haven't done any profiling/performance
analysis to gauge the impact of this change. The immediate goal is to
start experimenting with non-float/vector programs and get them
profiling well. This will require more "macro opcodes" such as new
addressing modes and opcodes-with-immediates. These are expected to be
relatively hot opcodes and thus probably want to be in the "fast" 1-byte
namespace, hence the shuffling here.

My plan is to in the future do a bit of an evaluation with a
float/vector program and see whether it make sense to shuffle some of
them into this 1-bytecode space as well. More radically it might make
sense to remove the split between ops/extended ops and instead just have
a 2-byte opcode space entirely. That's all left for future evaluations
though.

* Fix test offsets

* Update test expectations
This commit extends the pulley opcode space with integer
addition/subtraction where `src2` is an immediate. The goal here is to
be a "sort of macro instruction" despite it not being too too macro
here. This cuts down on `xconstN` instructions which both saves space in
the final binary and should be slightly more optimal perf-wise due to
only one dispatch being needed.

In this commit the `xadd32` instruction is previously 3 bytes: one for
an opcode and 2 bytes for the dst/src1/src2 binary operands. Adding a
small constant to a register previously took 5 bytes where 2 bytes were
needed for `xconst8 N` then 3 for the addition. Here the encoding size
of the new instruction is 4 bytes: 1 for the opcode, 2 for dst/src1, and
one for the immediate. This is currently chosen to mostly optimize
dispatch in the interpreter loop as opposed to code size (as only a
single byte is saved). In the future thought it would be possible to
extend `BinaryOperands` to one operand being a 6-bit immediate to
preserve the same code size.

This also notably adds, for addition/subtraction, only unsigned
immediates. With addition/subtraction being inverses of one another
supporting signed immediates isn't necessary and helps free up another
bit for packing numbers into these opcodes.

This change reduces the size of `spidermonkey.cwasm` from 31M to 29M
locally.
* add Default and Debug impls to SparseMap

* more meaningful debug impl
This commit extends the set of opcodes to load/stores from memory with
integer registers. Previously the only addressing mode supported was a
base register plus a 32-bit signed immediate. This immediate frequently
doesn't need 32-bits though and can often fit in a much smaller range.
Looking at `spidermonkey.cwasm` a large number of loads/stores can fit
within an unsigned 8-bit integer instead so this commit adds an
`offset8` mode in addition to the preexisting `offset32` mode.
Empirically this commit shrinks `spidermonkey.cwasm` for pulley64 from
33M to 31M.

This notably, at this time, does not extend general addressing modes in
Pulley nor does it extend all loads/stores. For example
float/vector/big-endian loads and stores all continue to only support a
32-bit signed offset from the base pointer. This is done under the
assumption that integer loads/stores dominate both
performance/code-size, but this is not empirically proven just yet.

Additionally at this time the choice is being made to add an
opcode-per-addressing-mode rather than having a single load opcode take
a general addressing mode. The assumption here is that decoding a fully
general addressing mode and processing it is probably slower at runtime
than specializing opcodes per addressing mode. This is currently an
unproven assumption however and the cost of this is increased complexity
in the Cranelift backend as it has to have many branches for all
loads/stores supported.
…ce#9862)

* pulley: Remove unwind metadata from Cranelift backend

This is a copy/paste from the riscv64 backend originally but there's no
need to integrate with native unwinders on Pulley so there's no need to
track this information. This removes the `Unwind` pseudo-inst entirely.

* Fix CI and review comments
* pulley: Add immediate payloads to more opcodes

This commit adds immediate payloads to the following instructions:

* `xmul32` - `xmul32_s8` / `xmul32_s32`
* `xmul64` - `xmul64_s8` / `xmul64_s32`
* `xband32` - `xband32_s8` / `xband32_s32`
* `xband64` - `xband64_s8` / `xband64_s32`
* `xbor32` - `xbor32_s8` / `xbor32_s32`
* `xbor64` - `xbor64_s8` / `xbor64_s32`
* `xbxor32` - `xbxor32_s8` / `xbxor32_s32`
* `xbxor64` - `xbxor64_s8` / `xbxor64_s32`
* `xshl32` - `xshl32_u6`
* `xshl64` - `xshl64_u6`
* `xshr32_u` - `xshl32_u_u6`
* `xshr64_u` - `xshl64_u_u6`
* `xshr32_s` - `xshl32_s_u6`
* `xshr64_s` - `xshl64_s_u6`

For shifts there's no need to have 32-bit immediates (or even 8-bit)
since 6 bits is enough to encode all the immediates. This means that the
6-bit immediate is packed within `BinaryOperands` as a new `U6` type.

This commit unfortunately does not shrink `spidermonkey.cwasm`
significantly beyond the prior 29M. This is nevertheless expected to be
relatively important for performance.

* Fix test expectations
)

This commit adds a large number of new `br_if_x*` instructions which
compare with an immediate instead of comparing two registers. This is
pretty common in wasm/compiled code where, for example, loop upper
bounds are often constants. This helps compress code slightly while
fusing more instructions together.

The main cost of this is that the number of opcodes added here is quite
large. Like with previous immediate-taking opcodes both 8 and 32-bit
variants of immediates are added for all comparisons. Additionally
unlike the previous set of branch-and-compare instructions it's required
to add instructions for `>` and `>=` because the operands cannot be
swapped to invert the condition, further increasing the number of
opcodes added.

This is a mild size reduction on `spidermonkey.cwasm` from 29M to 28M
but it's mostly expected to be a performance win for interpreted loops.
…dealliance#9864)

* pulley: Add macro instructions for function prologue/epilogue

This commit adds two new instructions to Pulley to combine the
operations of setting up a frame, allocating stack, and saving clobbered
registers. This is all combined into a single instruction which is
relatively large but is much smaller than each of these individual
operations exploded out.

This is a size win on `spidermonkey.cwasm` by about 1M and locally in a
small `fib.wat` test this is also a good speedup by reducing the number
of instructions executed.

* Review comments and update test expectations
Filling out some more miscellaneous `*.wast` tests
* Test DWARF with --release

This would have caught the issue in CI.

It is also closer to what the end user will use.

* [prtest:debug] Use volatile for keep-alive

* Drop __attribute__((retain))

* Run tests in debug as well

* Remove debug testing; add comment
* rename put_non_zero_in_reg to put_nonzero_in_reg_maybe_zext

* handle 32bit sdiv

* i32 rem_u/s

* fix tests

* fix rem

* fmt

* improve load_constant_full

* fix tests, and review edits
This commit fixes a few mistakes that were introduced in bytecodealliance#9863.
Specifically when lowering `Cond.If...` and the arguments needed
swapping the condition was also inverted by accident. More `*.clif`
runtests were added to catch this case and expose it. Additionally
Pulley now has lowering for all the `FloatCC` orderings to be able to
run the `select.clif` runtest which primarily exposed the issue.
* pulley: Implement some float simd ops

Gets a few more wast tests passing

* Enable some cranelift runtests
Refactor some existing usage of `pulley_xconst*` instructions to instead
use instructions-taking-immediates instead now that they've been added
to Pulley.
@dhil dhil merged commit bd991b2 into wasmfx:main Dec 20, 2024
43 checks passed
@dhil dhil deleted the wasmfx-merge branch December 20, 2024 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.